{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div><font face=\"Times New Roman\" size=7><br><br>\n",
    "<center>\n",
    "Linear Regression and Regularization\n",
    "<center><br></div>\n",
    " \n",
    "\n",
    "### Machine Learning for Bioinformatics: Homework 1 (Practical)\n",
    "*Refer to (preferably)Quera or Amir Soleimanifar for any questions you have or other inconveniences*  \n",
    "*Telegram ID: @amirsoleix*  \n",
    "*Email: asoleix@gmail.com*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview\n",
    "<font face=\"Arial\" size=4><br>\n",
    "We want to train a model which estimates obesity levels based on eating habits and physical conditions of an individual. For our purpose, we will use a dataset of individuals from the countries of Mexico, Peru and Columbia.  \n",
    "The dataset was collected by Fabio Mendoza Palechor and ALexis de la Hoz Manotas."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Description \n",
    "<font face=\"Arial\" size=4><br>\n",
    "This dataset include data for the estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition. The data contains 17 attributes and 2111 records, the records are labeled with the class variable NObesity (Obesity Level), that allows classification of the data using the values of Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III. 77% of the data was generated synthetically using the Weka tool and the SMOTE filter, 23% of the data was collected directly from users through a web platform."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Practical Phase \n",
    "<font face=\"Arial\" size=4><br>\n",
    "You need to complete each section by writing the relevant code, running and assessing the results. Feel free to add new code or markdown cells. After all snippets have been completed, save the results and upload the Jupyter notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import all libraries\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import sklearn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(2111, 17)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Gender</th>\n",
       "      <th>Age</th>\n",
       "      <th>Height</th>\n",
       "      <th>Weight</th>\n",
       "      <th>family_history_with_overweight</th>\n",
       "      <th>FAVC</th>\n",
       "      <th>FCVC</th>\n",
       "      <th>NCP</th>\n",
       "      <th>CAEC</th>\n",
       "      <th>SMOKE</th>\n",
       "      <th>CH2O</th>\n",
       "      <th>SCC</th>\n",
       "      <th>FAF</th>\n",
       "      <th>TUE</th>\n",
       "      <th>CALC</th>\n",
       "      <th>MTRANS</th>\n",
       "      <th>BMI</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Female</td>\n",
       "      <td>21.0</td>\n",
       "      <td>1.62</td>\n",
       "      <td>64.0</td>\n",
       "      <td>yes</td>\n",
       "      <td>no</td>\n",
       "      <td>2.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>Sometimes</td>\n",
       "      <td>no</td>\n",
       "      <td>2.0</td>\n",
       "      <td>no</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>no</td>\n",
       "      <td>Public_Transportation</td>\n",
       "      <td>18.9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Female</td>\n",
       "      <td>21.0</td>\n",
       "      <td>1.52</td>\n",
       "      <td>56.0</td>\n",
       "      <td>yes</td>\n",
       "      <td>no</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>Sometimes</td>\n",
       "      <td>yes</td>\n",
       "      <td>3.0</td>\n",
       "      <td>yes</td>\n",
       "      <td>3.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Sometimes</td>\n",
       "      <td>Public_Transportation</td>\n",
       "      <td>22.7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Male</td>\n",
       "      <td>23.0</td>\n",
       "      <td>1.80</td>\n",
       "      <td>77.0</td>\n",
       "      <td>yes</td>\n",
       "      <td>no</td>\n",
       "      <td>2.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>Sometimes</td>\n",
       "      <td>no</td>\n",
       "      <td>2.0</td>\n",
       "      <td>no</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>Frequently</td>\n",
       "      <td>Public_Transportation</td>\n",
       "      <td>21.6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Male</td>\n",
       "      <td>27.0</td>\n",
       "      <td>1.80</td>\n",
       "      <td>87.0</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>Sometimes</td>\n",
       "      <td>no</td>\n",
       "      <td>2.0</td>\n",
       "      <td>no</td>\n",
       "      <td>2.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Frequently</td>\n",
       "      <td>Walking</td>\n",
       "      <td>28.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Male</td>\n",
       "      <td>22.0</td>\n",
       "      <td>1.78</td>\n",
       "      <td>89.8</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>Sometimes</td>\n",
       "      <td>no</td>\n",
       "      <td>2.0</td>\n",
       "      <td>no</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Sometimes</td>\n",
       "      <td>Public_Transportation</td>\n",
       "      <td>26.6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Gender   Age  Height  Weight family_history_with_overweight FAVC  FCVC  \\\n",
       "0  Female  21.0    1.62    64.0                            yes   no   2.0   \n",
       "1  Female  21.0    1.52    56.0                            yes   no   3.0   \n",
       "2    Male  23.0    1.80    77.0                            yes   no   2.0   \n",
       "3    Male  27.0    1.80    87.0                             no   no   3.0   \n",
       "4    Male  22.0    1.78    89.8                             no   no   2.0   \n",
       "\n",
       "   NCP       CAEC SMOKE  CH2O  SCC  FAF  TUE        CALC  \\\n",
       "0  3.0  Sometimes    no   2.0   no  0.0  1.0          no   \n",
       "1  3.0  Sometimes   yes   3.0  yes  3.0  0.0   Sometimes   \n",
       "2  3.0  Sometimes    no   2.0   no  2.0  1.0  Frequently   \n",
       "3  3.0  Sometimes    no   2.0   no  2.0  0.0  Frequently   \n",
       "4  1.0  Sometimes    no   2.0   no  0.0  0.0   Sometimes   \n",
       "\n",
       "                  MTRANS   BMI  \n",
       "0  Public_Transportation  18.9  \n",
       "1  Public_Transportation  22.7  \n",
       "2  Public_Transportation  21.6  \n",
       "3                Walking  28.0  \n",
       "4  Public_Transportation  26.6  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Read the dataset\n",
    "df = pd.read_csv('./dataset_bmi.csv')\n",
    "print(df.shape)\n",
    "df.head()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Split the Data  \n",
    "<font face=\"Arial\" size=4><br>\n",
    "Split the data to training (80 percent) and test (20 percent) sets using Stratified Sampling on `BMI` column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exploratory Data Analysis  \n",
    "<font face=\"Arial\" size=4><br>\n",
    "Analyze the data and find information about different attributes. Requested items are:\n",
    "1. Number of categories and distribution of them in either plots or numbers\n",
    "2. Mean, std, and quartiles of numerical attributes\n",
    "3. Check for existence of NaN or empty rows  \n",
    "\n",
    "Visualize the dataset in convenient way and measure the correlation between different columns of the matrix using `corr()` command."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Cleaning and Manipulation  \n",
    "<font face=\"Arial\" size=4><br>\n",
    "Transform all categorical attributes to numerical attributes using ordinal encoders if the attribute is ordinal or one-hot encoders if the attribute is nominal.   \n",
    "\n",
    "Add at least two attributes to the dataset using information of other columns. Explain the reasons you think the added columns are better indicators of the data.  \n",
    "Scale the data and build a pipeline to be used later for test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Principal Component Analysis  \n",
    "<font face=\"Arial\" size=4><br>\n",
    "Use `sklearn.decomposition.PCA` to reduce the dimension of dataset to a convenient number. Plot the scree plot for the final solution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model Training  \n",
    "<font face=\"Arial\" size=4><br>\n",
    "Train the <code>linear regression</code> model and its regularized forms (<code>ridge</code> and <code>lasso</code>) on your training data. Cross-validate the models using <code>10 fold CV</code> and report the accuracy scores. You are allowed to use <code>sklearn.linear_model</code> for your implementation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Linear Regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Ridge Regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Lasso Regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Final Report  \n",
    "<font face=\"Arial\" size=4><br>\n",
    "After training the data, use your pipeline previously created to transform the test data to decent form and then run your final model and report the accuracy score."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font face=\"Arial\" size=4><br>\n",
    "Discuss why you think this model worked best for the selected datasets and mention 3 areas where extra effort can be put into work to enhance the accuracy."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.0"
  },
  "vscode": {
   "interpreter": {
    "hash": "ce1b09b10ddf3c0c2840855cfede39e38bc763229e16b97ef7234a53f605a92b"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
